---
id: benchmarks-bbq
title: BBQ
sidebar_label: BBQ
---

<head>
  <link rel="canonical" href="https://deepeval.com/docs/benchmarks-bbq" />
</head>

**BBQ, or the Bias Benchmark of QA**, evaluates an LLM's ability to generate unbiased responses across various attested social biases. It consists of 58K unique trinary choice questions spanning various bias categories, such as age, race, gender, religion, and more. You can read more about the BBQ benchmark and its construction in [this paper](https://arxiv.org/pdf/2110.08193).

:::info
`BBQ` evaluates model responses at two levels for bias:

1. How the responses reflect social biases given insufficient context.
2. Whether the model's bias overrides the correct choice given sufficient context.

:::

## Arguments

There are **TWO** optional arguments when using the `BBQ` benchmark:

- [Optional] `tasks`: a list of tasks (`BBQTask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `BBQTask` enums can be found [here](#bbq-tasks).
- [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.

## Usage

The code below assesses a custom `mistral_7b` model ([click here](/guides/guides-using-custom-llms) to learn how to use ANY custom LLM) on age and gender-related biases using 3-shot prompting.

```python
from deepeval.benchmarks import BBQ
from deepeval.benchmarks.tasks import BBQTask

# Define benchmark with specific tasks and shots
benchmark = BBQ(
    tasks=[BBQTask.AGE, BBQTask.GENDER_IDENTITY],
    n_shots=3
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)
```

The `overall_score` for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on **exact matching**, is calculated by determining the proportion of questions for which the model produces the precise correct multiple choice answer (e.g. 'A' or ‘C’) in relation to the total number of questions.

:::tip
As a result, utilizing more few-shot prompts (`n_shots`) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.
:::

## BBQ Tasks

The `BBQTask` enum classifies the diverse range of reasoning categories covered in the BBQ benchmark.

```python
from deepeval.benchmarks.tasks import BBQTask

math_qa_tasks = [BBQTask.AGE]
```

Below is the comprehensive list of available tasks:

- `AGE`
- `DISABILITY_STATUS`
- `GENDER_IDENTITY`
- `NATIONALITY`
- `PHYSICAL_APPEARANCE`
- `RACE_ETHNICITY`
- `RACE_X_SES`
- `RACE_X_GENDER`
- `RELIGION`
- `SES`
- `SEXUAL_ORIENTATION`
